{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# [DIS 7] ReLUs and Neural Network Intuition\n",
                "\n",
                "**If you are running on a local anaconda install, you will need to install pytorch with the command**\n",
                "```sh\n",
                "conda install pytorch -c pytorch\n",
                "```\n",
                "\n",
                "You should immediately run all the cells up through 'Train all layers' since training the networks takes a long time. While you wait you can return to the theory portion of this discussion and work on the backpropagation problem.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "import torch\n",
                "import torch.nn as nn\n",
                "import matplotlib.pyplot as plt\n",
                "import numpy as np\n",
                "import copy\n",
                "import time\n",
                "from ipywidgets import fixed, interactive, widgets\n",
                "\n",
                "from helpers import *\n",
                "\n",
                "%matplotlib inline\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Generate Training and Test Data\n",
                "\n",
                "We are again using the piecewise linear function you have seen in past homeworks and discussions. Our training data has added noise $y = f(x) + \\epsilon,\\, \\epsilon \\sim \\mathcal{N}(0, \\sigma^2)$. The test data is noise free.\n",
                "\n",
                "_Once you have gone through the discussion once you may wish to adjust the number of training samples and noise variance to see how gradient descent behaves under the new conditions._\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "f_type = 'piecewise_linear'\n",
                "\n",
                "def f_true(X, f_type):\n",
                "    if f_type == 'sin(20x)':\n",
                "        return np.sin(20 * X[:,0])\n",
                "    else:\n",
                "        TenX = 10 * X[:,0]\n",
                "        _ = 12345\n",
                "        return (TenX - np.floor(TenX)) * np.sin(_ * np.ceil(TenX)) - (TenX - np.ceil(TenX)) * np.sin(_ * np.floor(TenX))\n",
                "\n",
                "n_features = 1\n",
                "n_samples = 200\n",
                "sigma = 0.1\n",
                "rng = np.random.RandomState(1)\n",
                "\n",
                "# Generate train data\n",
                "X = np.sort(rng.rand(n_samples, n_features), axis=0)\n",
                "y = f_true(X, f_type) + rng.randn(n_samples) * sigma\n",
                "\n",
                "# Generate NOISELESS test data\n",
                "X_test = np.concatenate([X.copy(), np.expand_dims(np.linspace(0., 1., 1000), axis=1)])\n",
                "X_test = np.sort(X_test, axis=0)\n",
                "y_test = f_true(X_test, f_type)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Define the Neural Networks\n",
                "\n",
                "We will learn the piecewise linear target function using a simple 1-hidden layer neural network with ReLU non-linearity, defined by\n",
                "$$ \\hat{y} = \\mathbf{W}^{(2)} \\Phi \\left( \\mathbf{W}^{(1)} x + \\mathbf{b}^{(1)} \\right) + \\mathbf{b}^{(2)} $$\n",
                "where $\\Phi(x) = ReLU(x)$ and superscripts refer to indices, not the power operator.\n",
                "\n",
                "We will also create two SGD optimizers to allow us to choose whether to train all parameters or only the linear output layer's parameters. Note that we use separate learning rates for the two version of training. There is too much variance in the gradients when training all layers to use a large learning rate, so we have to decrease it.\n",
                "\n",
                "We will modify the default initialization of the biases so that the ReLU elbows are all inside the region we are interested in.\n",
                "\n",
                "We create several versions of this network with varying widths to explore how hidden layer width impacts learning performance.\n",
                "\n",
                "_Once you have gone through the discussion once you may wish to train networks with even larger widths to see how they behave under the three different training paradigms in this notebook._\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Don't rerun this cell after training or you will lose all your work\n",
                "nets_by_size = {}\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "widths = [10, 20, 40]\n",
                "for width in widths:\n",
                "    # Define a 1-hidden layer ReLU nonlinearity network\n",
                "    net = nn.Sequential(nn.Linear(1, width),\n",
                "                        nn.ReLU(),\n",
                "                        nn.Linear(width, 1))\n",
                "    loss = nn.MSELoss()\n",
                "    # Get trainable parameters\n",
                "    weights_all = list(net.parameters())\n",
                "    # Get the output weights alone\n",
                "    weights_out = weights_all[2:]\n",
                "    # Adjust initial biases so elbows are in [0,1]\n",
                "    elbows = np.sort(np.random.rand(width))\n",
                "#     print(\"Elbows located at:\")\n",
                "#     print(elbows)\n",
                "    new_biases = -elbows * to_numpy(weights_all[0]).ravel()\n",
                "    weights_all[1].data = to_torch(new_biases)\n",
                "    # Create SGD optimizers for outputs alone and for all weights\n",
                "    lr_out = 0.2\n",
                "    lr_all = 0.02\n",
                "    opt_all = torch.optim.SGD(params=weights_all, lr=lr_all)\n",
                "    opt_out = torch.optim.SGD(params=weights_out, lr=lr_out)\n",
                "    # Save initial state for comparisons\n",
                "    initial_weights = copy.deepcopy(net.state_dict())\n",
                "    # print(\"Initial Weights\", initial_weights)\n",
                "    nets_by_size[width] = {'net': net, 'opt_all': opt_all,\n",
                "                           'opt_out': opt_out, 'init': initial_weights}\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Visualize the starting ReLU corners & slopes, and total function\n",
                "\n",
                "The dashed lines represent the output of an individual ReLU function in the hidden layer for an $x$ input passed through the first linear layer. The gray points mark the elbows of the ReLU functions where their output becomes non-zero.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "for w in widths:\n",
                "    net = nets_by_size[w]['net']\n",
                "    print(\"Width\", w)\n",
                "    plot_update(X, y, X_test, y_test, net)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Train all the networks now - this will take a while!\n",
                "You can expect training to take between 5 and 10 minutes depending on whether you run locally or on datahub and how heavily loaded datahub is at the moment.\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Train only the output layer\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "n_steps = 150000\n",
                "save_every = 1000\n",
                "t0 = time.time()\n",
                "for w in widths:\n",
                "    print(\"-\"*40)\n",
                "    print(\"Width\", w)\n",
                "    net = nets_by_size[w]['net']\n",
                "    opt_out = nets_by_size[w]['opt_out']\n",
                "    initial_weights = nets_by_size[w]['init']\n",
                "    history_output = train_network(X, y, X_test, y_test,\n",
                "                            net, optim=opt_out,\n",
                "                            n_steps=n_steps, save_every=save_every,\n",
                "                            initial_weights=initial_weights,\n",
                "                            verbose=False)\n",
                "    nets_by_size[w]['hist_out'] = history_output\n",
                "    print(\"Width\", w)\n",
                "    plot_test_train_errors(history_output)\n",
                "    print(\"Elapsed time %.1f minutes\" % ((time.time() - t0) / 60))\n",
                "t1 = time.time()\n",
                "print(\"-\"*40)\n",
                "print(\"Trained output layer in %.1f minutes\" % ((t1 - t0) / 60))\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Train all layers\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "n_steps = 150000\n",
                "save_every = 1000\n",
                "t0 = time.time()\n",
                "for w in widths:\n",
                "    print(\"-\"*40)\n",
                "    print(\"Width\", w)\n",
                "    net = nets_by_size[w]['net']\n",
                "    opt_all = nets_by_size[w]['opt_all']\n",
                "    initial_weights = nets_by_size[w]['init']\n",
                "    history_all = train_network(X, y, X_test, y_test,\n",
                "                            net, optim=opt_all,\n",
                "                            n_steps=n_steps, save_every=save_every,\n",
                "                            initial_weights=initial_weights,\n",
                "                            verbose=False)\n",
                "    nets_by_size[w]['hist_all'] = history_all\n",
                "    print(\"Width\", w)\n",
                "    plot_test_train_errors(history_all)\n",
                "t1 = time.time()\n",
                "print(\"-\"*40)\n",
                "print(\"Trained all layers in %.1f minutes\" % ((t1 - t0) / 60))\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Train and Test Error\n",
                "In the train and test error plots, you should observe that, after sufficient training, the test error is consistently lower than the train error. Normally we expect the opposite to happen. **What is causing test error to be lower than train error?**\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "_Your answer..._\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Learn the Output Layer Using Ridge Regression\n",
                "\n",
                "We can treat the output of the hidden layer as a featurization of $x$ and use ridge regression to choose weights for the output layer instead of iterating with SGD. We have set up the infrastructure to extract the featurization of the training data and write your learned coefficients back into the network. You will **perform ridge regression with the hidden layer featurization with your choice of regularization parameter $\\lambda$** and compare the result with networks learned by gradient descent.\n",
                "\n",
                "**Learn the last layer weights with ridge regression for each network width and compare the shape of the learned functions, test error, and best regularization coefficients.**\n",
                "\n",
                "You don't need to rigorously choose the regularization coefficient through cross-validation, but you should at least explore a range of values and pick a reasonable value.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "width = 10 # Options are 10, 20, 40\n",
                "# Reinititalize net\n",
                "net = nets_by_size[width]['net']\n",
                "initial_weights = nets_by_size[width]['init']\n",
                "net.load_state_dict(initial_weights)\n",
                "\n",
                "# Featurize x using the hidden layer\n",
                "W1 = initial_weights['0.weight']\n",
                "b1 = initial_weights['0.bias']\n",
                "Xnew = to_torch(X) @ W1.T + b1[None, :]\n",
                "Xnew = to_numpy(nn.functional.relu(Xnew))\n",
                "# Add the bias term\n",
                "tmp = np.ones((Xnew.shape[0], width + 1))\n",
                "tmp[:, :-1] = Xnew\n",
                "Xnew = tmp\n",
                "\n",
                "# TODO: Perform ridge regression on the featurization of x\n",
                "# Store the learned coefficients in a width+1 size array called 'coeffs'\n",
                "### start last_layer_rr ###\n",
                "\n",
                "### end last_layer_rr ###\n",
                "\n",
                "# Set the output layer parameters to the learned coefficients\n",
                "bias = to_torch(np.array(coeffs[-1]))\n",
                "w = to_torch(coeffs[:-1].reshape(1, width))\n",
                "all_params = list(net.parameters())\n",
                "all_params[-1].data = bias\n",
                "all_params[-2].data = w\n",
                "\n",
                "# Test with learned output layer\n",
                "plot_update(X, y, X_test, y_test, net)\n",
                "test_mse = np.mean((to_numpy(net(to_torch(X_test))).ravel() - y_test) ** 2)\n",
                "print(\"Test Error:\", test_mse)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "_Your description of learned function shape and test error dependence on width..._\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Visualize training steps\n",
                "\n",
                "We saved the network state at regular intervals during the training process. We will visualize the evolution of the network through training below. Pay close attention to the movement of the ReLU elbows and slopes and how they influence the overall learned function. Is there a particular feature of the learned function that is directly influenced by the location of the elbows?\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Train Output Layer Weights Only\n",
                "\n",
                "Questions to consider while exploring the training process for last layer weights only:\n",
                "- **How does the hidden layer width impact the learned function and test error?**\n",
                "- **If you could hand-pick the elbow locations, where would you place them?**\n",
                "- **How do the final test error and learned function compare to the ridge regression version of training? Which one is more efficient?**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "width = 10 # Options are 10, 20, 40\n",
                "history_output = nets_by_size[width]['hist_out']\n",
                "net = nets_by_size[width]['net']\n",
                "make_history_interactive(history_output, X, y, X_test, y_test, net)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Train All Layer Weights\n",
                "\n",
                "Questions to consider while exploring the training process for all layer weights:\n",
                "- **How does the hidden layer width impact the learned function and test error?**\n",
                "- **What happens to the elbow locations during training?**\n",
                "- **How do the final test error and learned function compare to the ridge regression version of training? Which one is more efficient? Your answer may differ for different hidden layer widths.**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "width = 40 # Options are 10, 20, 40\n",
                "history_all = nets_by_size[width]['hist_all']\n",
                "net = nets_by_size[width]['net']\n",
                "make_history_interactive(history_all, X, y, X_test, y_test, net)\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": []
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.8.5"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 4
}